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Managing the hierarchical organization of data is starting to play a key role in the 
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Understanding distributed applications is a tedious and difficult task. Visualizations based 
on process-time diagrams are often used to obtain a better understanding of the 
execution of the application. The visualization tool we use is Poet, an event tracer 
developed at the University of Waterloo. However, these diagrams are often very complex 
and do not provide the user with the desired overview of the application. In our 
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terms , review 

Clustering is the unsupervised classification of patterns (observations, data items, or 
feature vectors) into groups (clusters). The clustering problem has been addressed in 
many contexts and by researchers in many disciplines; this reflects its broad appeal and 
usefulness as one of the steps in exploratory data analysis. However, clustering is a 
difficult problem combinatorially, and differences in assumptions and contexts in different 
communities has made the transfer of useful generic co ... 
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Full text available* filDdf(323 72 KB) Additional Information: full citation, abstract , references , index terms . 

In the recent years, the Web has been rapidly "deepened" with the prevalence of 
databases online. On this deep Web, many sources are <i>structured</i> by providing 
structured query interfaces and results. Organizing such structured sources into a domain 
hierarchy Is one of the critical steps toward the integration of heterogeneous Web 
sources. We observe that, for structured Web sources, query schemas <i>ie</i>, 
attributes in query interfaces) are discriminative representative ... 
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Publisher: Springer-Verlag Mew York, Inc. 

Full text available: ^ pdf(262.85 KB) Additional Information: full citation, abstract , citings, index terms 

The requirements for effective search and management of the WWW are stronger than 
ever. Currently Web documents are classified based on their content not taking into 
account the fact that these documents are connected to each other by links. We claim 
that a page's classification is enriched by the detection of its incoming links' semantics. 
This would enable effective browsing and enhance the validity of search results in the 
WWW context. Another aspect that is underaddressed and str ... 

Keywords: Document clustering, Link analysis. Link management. Semantics, Similarity 
measure, World Wide Web 
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It Is crucial in many information systems to organize short text segments, such as 
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keywords in documents and queries from users, into a well-formed taxonomy. In this 
article, we address the problem of taxonomy generation for diverse text segments with a 
general and practical approach that uses the Web as an additional knowledge source. 
Unlike long documents, short text segments typically do not contain enough information 
to extract reliable features. This work Investigates the possibilities of u ... 

Keywords: Taxonomy generation; hierarchical clustering, partitioning, search-result 
snippet, text data mining, text segment 
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Face recognition: A literature survey 

W. Zhao, R. Chellappa, P. J. Phillips, A. Rosenfeld 

December 2003 ACM Computing Surveys (CSUR), Volume 35 issue 4 

Publisher: ACM Press 

Full text available- ■ fg|pdf(4.28 MB) Additional Information: full citation, abstract , references , citings , index 
' l^sH^-^- terms 

As one of the most successful applications of image analysis and understanding, face 
recognition has recently received significant attention, especially during the past several 
years. At least two reasons account for this trend: the first is the wide range of 
commercial and law enforcement applications, and the second is the availability of 
feasible technologies after 30 years of research. Even though current machine recognition 
systems have reached a certain level of maturity, their success is ... 

Keywords: Face recognition, person identification 



Improving statistical language model performance with automatically generated word 
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John G. McMahon, Francis J. Smith 

June 1996 Computational Linguistics, volume 22 issue 2 

Publisher: MIT Press 

Full text available: ^ igj] 

■p |pdf(2.02MB)^ Additional information: full citation , abstract , references , citings 
Publisher Site 

An automatic word-classification system has been designed that uses word unlgram and 
bigram frequency statistics to implement a binary top-down form of word clustering and 
employs an average class mutual information metric. Words are represented as structural 
tags— n-bit numbers the most significant bit-patterns of which incorporate class 
information. The classification system has revealed some of the lexical structure of 
English, as well as some phonemic and semantic structure. The syst ... 

Automated techniques for managing collections: Machine learning for information 
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architecture in a large governmental website 

Miles Efron, Jonathan Elsas, Gary Marchionini, Junliang Zhang 

June 2004 Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries 
Publisher: ACM Press 

Full text available: ^pdf{1.49MB) Additional Information: full citation , abstract , references , index terms 

This paper describes ongoing research into the application of machine learning techniques 
for improving access to governmental information in complex digital libraries. Under the 
auspices of the GovStat Project, our goal is to identify a small number of semantically 
valid concepts that adequately spans the intellectual domain of a collection. The goal of 
this discovery is twofold. First we desire a practical aid for information architects. Second, 
automatically derived document-concept relations ... 

Keywords: information architecture, interface design, machine learning 
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' ' terms 

Web personalization is the process of customizing a Web site to the needs of each specific 
user or set of users, taking advantage of the knowledge acquired through the analysis of 
the user's navigational behavior. Integrating usage data with content, structure or user 
profile data enhances the results of the personalization process. In this paper, we present 
SEWeP, a system that makes use of both the usage logs and the semantics of a Web 
site's content in order to personalize it. Web content is ... 

Keywords: Web mining, Web personalization, concept hierarchies, semantic annotation 
of Web content 



^ Hierarchical file organization and its application to similar-string matching 
Tetsuro Ito, Makoto Klzawa 

September 1983 ACM Transactions on Database Systems (TODS), volume 8 issue 3 
Publisher: ACi\^ Press 

Full text available: 1p| pdf(1.54 MB) Additional Information: fuH citation , abstract , references, citings, index 

terms 

The automatic correction of misspelled inputs Is discussed from a viewpoint of similar- 
string matching. First a hierarchical file organization based on a linear ordering of records 
is presented for retrieving records highly similar to any input query. Then the spelling 
problem is attacked by constructing a hierarchical file for a set of strings in a dictionary of 
English words. The spelling correction steps proceed as follows: (1) find one of the best- 
match strings which are most similar to ... 

Keywords: best match, file organization, good match, hierarchical clustering, linear 
ordering, office automation, similar-string, similarity, spelling correction, text editor 
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information and data management 

Publisher: ACM Press 

Full text available: ^ pdf(1 80.53 KB) Additional Information: full citation , abstract , references , index terms 

Hierarchical categorization of documents is a task receiving growing Interest due to the 
widespread proliferation of topic hierarchies for text documents. The worst problem of 
hierarchical supervised classifiers is their high demand in terms of labeled examples, 
whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a 
huge hierarchy with a proper set of labeled examples is a critical issue. In this paper, we 
propose some solutions for the bootstrapping problem, imp ... 

Keywords: TaxSOM, constrained clustering, digital libraries, k-means, knowledge 
management, taxonomy bootstrapping process, text categorization, web directories 



Posters: Content-based image retrieval by clustering 
Yixin Chen, James Z. Wang, Robert Krovetz 

November 2003 Proceedings of the 5th ACM SIGMM international workshop on 
Multimedia information retrieval 

Publisher: ACM Press 

Full text available' pdf(658 35 KB^ Additional Information: full citation , abstract , references , citings , index 
^ terms 

In a typical content-based image retrieval (CBIR) system, query results are a set of 
images sorted by feature similarities with respect to the query. However, images with 
high feature similarities to the query may be very different from the query in terms of 
semantics. This is known as the semantic gap. We introduce a novel image retrieval 
scheme, CLUster-based retrieval of images by unsupervised learning (CLUE), which 
tacl<les the semantic gap problem based on a hypothesis: semantically simil ... 

Keywords: content-based image retrieval, image classification, spectral graph clustering, 
unsupervised learning 



Generation and search of clustered files 
G. Salton, A. Wong 

December 1978 ACM Transactions on Database Systems (TODS), volume 3 issue 4 
Publisher: ACM Press 

Full text available- ^pdf(1.78 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

A classified, or clustered file is one where related, or similar records are grouped into 
classes, or clusters of items in such a way that all items within a cluster are jointly 
retrievable. Clustered files are easily adapted to broad and narrow search strategies, and 
simple file updating methods are available. An inexpensive file clustering method 
applicable to large files is given together with appropriate file search methods. An abstract 
model is then introduced to predict the retrieval ... 

Keywords: automatic classification, cluster searching, clustered files, fast classification, 
file organization, probabilistic models 
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Model-based clustering techniques have been widely used and have shown promising 
results In nnany applications involving complex data. This paper presents a unified 
framework for probabilistic model-based clustering based on a bipartite graph view of 
data and models that highlights the commonalities and differences among existing model- 
based clustering algorithms. In this view, clusters are represented as probabilistic models 
in a model space that is conceptually separate from the data space. For ... 

17 Named entities 2: Automatic feature thesaurus enrichment: extracting generic terms Q 
^ from digital gazetteer 
^ Jun Wang, Ning Ge 

June 2006 Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries 
JCDL '06 

Publisher: ACM Press 

Full text available: ^pdf(517.02 KB) Additional Information: full citation, abstract , references, index terms 

ADL Gazetteer is a digitallzed worldwide gazetteer developed in the Alexandria Digital 
Library (ADL) Project, which contains millions of geographic names (placenames). The 
placenames are indexed with type terms from the ADL Feature Type Thesaurus (FTT), a 
hierarchical category scheme. The paper proposes a two-step method to enrich the 
category scheme automatically: to discover frequent generic terms by detecting phase 
boundaries with a mutual information-based method, and to correlate the generi ... 

Keywords: automatic gazetteer updating, correlation analysis, digital gazetteer, generic 
term extraction 
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^ Jlnze Liu, Wei Wang, Jiong Yang 

August 2004 Proceedings of the tenth ACM SIGKDD international conference on 

Knowledge discovery and data mining KDD '04 
Publisher: ACM Press 

Full text available: pdf(685.02 KB) Additional Information: full citation , abstract, references, index terms 

Traditional clustering is a descriptive task that seeks to identify homogeneous groups of 
objects based on the values of their attributes. While domain knowledge is always the 
best way to justify clustering, few clustering algorithms have ever take domain knowledge 
into consideration. In this paper, the domain knowledge is represented by hierarchical 
ontology. We develop a framework by directly incorporating domain knowledge into 
clustering process, yielding a set of clusters with strong ontolog ... 

Keywords: ontology, subspace clustering, tendency preserving 




1^ IR-2 (information retrieval): web infornnation retrieval: A practical web-based 
^ approach to generating topic hierarchy for text segments 
^ Shul-Lung Chuang, Lee-Feng Chien 

November 2004 Proceedings of the thirteenth ACM international conference on 
Information and knowledge management CIKM '04 

Publisher: ACM Press 

Full text available- ^ Pdf(351 23 KB) Additional Information: full citation , abstract, references , citings, index 
^ terms 

It is crucial in many information systems to organize short text segments, such as 
keywords in documents and queries from users, into a well-formed topic hierarchy. In this 
paper, we address the problem of generating topic hierarchies for diverse text segments 
with a general and practical approach that uses the Web as an additional knowledge 
source. Unlike long documents, short text segments typically do not contain enough 
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information to extract reliable features. This work investigates the p ... 

Keywords: clustering, partitioning, search-result snippet, text segment, topic hierarchy 
generation, web data mining 



20 Image and cultural digital libraries: Time as essence for photo browsing through 
^ personal digital libraries 

^ Adrian Graham, Hector Garcia-Molina, Andreas Paepcke, Terry Winograd 

July 2002 Proceedings of the 2nd ACM/IEEE-CS joint conference on Digitai libraries 
Publisher: ACM Press 

Full text available: fiQ Ddf(3.39 MB) Additional Information: full citation, abstract, references , citings, index 
■ terms 

We developed two photo browsers for collections with thousands of time-stamped digital 
images. Modern digital cameras record photo shoot times, and semantically related 
photos tend to occur in bursts. Our browsers exploit the timing information to structure 
the collections and to automatically generate meaningful summaries. The browsers differ 
in how users navigate and view the structured collections. We conducted user studies to 
compare the two browsers and an un-summarized image browser. Our r ... 

Keywords: ACDSee, burst Identification, Image browser, personal digital library, photo 
browser, summarization, time-based clustering, time-based navigation 
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